setwd("D:/Outline/Webinar-1")
Any story has some characters and there are few main characters.
At times we dont know who are the main character. But would like to know.
Then there is another interesting aspect - relationship among characters. Nature of the relationship and strength of the relationship.
The nature of the relationship can be understood as positive / negative, direct or indirect etc.
Finally the most crucial ingredient i.e. time. Time is such a component with which nature as well as the strength changes.
Another part of visualisation is to generate interest or hold interest. So at times we will use colours for that.
Types:
Based on number of variables / characters (univariate / multivariate)
Static or interactive
In this brief presentation we shall get glimpse of each one of them.
When we develop stories from data “purpose” is of critical importance.
Is it a purely an exploratory excercise?
Are we explaining a relationship and how it would like to pan out in the futrue?
Are we going to explain a concept , more specifically a statistical concept?
In the following sections, we will start will very simple datatset and add complexities to that.
In the following section, we shall delving into a very small exercise.
income <- read.csv("income.csv")
str(income)
## 'data.frame': 1192 obs. of 6 variables:
## $ earn : num 50000 60000 30000 50000 51000 9000 29000 32000 2000 27000 ...
## $ height: num 74.4 65.5 63.6 63.1 63.4 ...
## $ sex : Factor w/ 2 levels "female","male": 2 1 1 1 1 1 1 2 2 2 ...
## $ ed : int 16 16 16 16 17 15 12 17 15 12 ...
## $ age : int 45 58 29 91 39 26 49 46 21 26 ...
## $ race : Factor w/ 4 levels "black","hispanic",..: 4 4 4 3 4 4 4 4 2 4 ...
summary(income)
## earn height sex ed age
## Min. : 200 Min. :57.50 female:687 Min. : 3.0 Min. :18.00
## 1st Qu.: 10000 1st Qu.:64.01 male :505 1st Qu.:12.0 1st Qu.:29.00
## Median : 20000 Median :66.45 Median :13.0 Median :38.00
## Mean : 23155 Mean :66.92 Mean :13.5 Mean :41.38
## 3rd Qu.: 30000 3rd Qu.:69.85 3rd Qu.:16.0 3rd Qu.:51.00
## Max. :200000 Max. :77.05 Max. :18.0 Max. :91.00
## race
## black :112
## hispanic: 66
## other : 25
## white :989
##
##
So we have 6 charcters which are self explanatory - only that ed refer to “education”.
We are considering “income” as a phenomena for our study. Therefore, “height” variable is not being considered.
ggplot(data = income, aes(x = race)) + geom_bar(aes(fill = race))
ggplot(data = income, aes(x = sex)) + geom_bar(aes(fill = sex))
ggplot(data = income, aes(x = sex)) + geom_bar(aes(fill = sex)) + facet_wrap(income$race)
All the figures talks about the compostion of the data in terms of race and gender. Observations are as follows -
Overwhelming presence of “white” people.
There are more females compared to males
Almost equal number people from each race have been involved in the study.
Please note all diagrames are static in nature, they are not interactive in nature. In the next section we would like to understand the demographic profile and will also experience “interactive” charts.
# First example of interactive plot
income %>% plot_ly(x = ~race) %>% add_histogram(color=~sex) %>% group_by(race, sex) %>%
summarise(n = n())
## Warning in RColorBrewer::brewer.pal(N, "Set2"): minimal value for n is 3, returning requested palette with 3 different levels
## Warning in RColorBrewer::brewer.pal(N, "Set2"): minimal value for n is 3, returning requested palette with 3 different levels
# Examining demographic profiles
table1 <- income %>% group_by(race,sex) %>% summarize(avg.income = mean(earn), Edu_Years = mean(ed),Nos = n())
colnames(table1) <- c("Race","Gender", "Avg.Income", "Avg.Education","Nos")
# Examining - educational profile - race and geder
plot_ly(income, y = ~ed, color = ~race, type = "box")
plot_ly(table1, x = ~Race, y = ~Avg.Education, color = ~Race, type = "bar")
plot_ly(table1, x = ~Race, y = ~Avg.Education, color = ~Gender, type = "bar")
## Warning in RColorBrewer::brewer.pal(N, "Set2"): minimal value for n is 3, returning requested palette with 3 different levels
## Warning in RColorBrewer::brewer.pal(N, "Set2"): minimal value for n is 3, returning requested palette with 3 different levels
The first plot is an ineractive one - once you place your cursor on the diagram it will provide information pertaining to that area.
In terms of educational “others” have more educated than other races.
In terms of average no. of education years , females belonging to “black” and “other” races are more educated compared to males.
In the next section - we will explore the income profile.
plot_ly(data = income, x = ~race, y = ~earn, color = ~race, type = "box")
plot_ly(data = income, x = ~sex, y = ~earn, color = ~sex, type = "box")
## Warning in RColorBrewer::brewer.pal(N, "Set2"): minimal value for n is 3, returning requested palette with 3 different levels
## Warning in RColorBrewer::brewer.pal(N, "Set2"): minimal value for n is 3, returning requested palette with 3 different levels
plot_ly(data = table1, x = ~Race, y = ~Avg.Income, color = ~Race, type = "bar")
plot_ly(data = table1, x = ~Gender, y = ~Avg.Income, color = ~Gender, type = "bar")
## Warning in RColorBrewer::brewer.pal(N, "Set2"): minimal value for n is 3, returning requested palette with 3 different levels
## Warning in RColorBrewer::brewer.pal(N, "Set2"): minimal value for n is 3, returning requested palette with 3 different levels
plot_ly(data = table1, x = ~Race, y = ~Avg.Income, color = ~Gender, size = ~Avg.Income, type = "bar")
## Warning in RColorBrewer::brewer.pal(N, "Set2"): minimal value for n is 3, returning requested palette with 3 different levels
## Warning in RColorBrewer::brewer.pal(N, "Set2"): minimal value for n is 3, returning requested palette with 3 different levels
In terms of income disparity following things can be noted -
Within “whites” there is a section of people whose income is significantly greater than others.
White people have the highest average income and hispanic earns lowest.
Scores of males who earns more than females.
The average income of females are greater than males ONLY for hispanic. For rest, average income of males are greater
In the following section we shall take a unified view - including educaiton and income.
ggplot(data = income, aes(x = ed, y = earn)) + geom_point(aes(color = sex)) + facet_wrap(income$race) + geom_smooth(method = "lm")
ggplot(data = income, aes(x = ed, y = earn)) + geom_point(aes(color = sex, size = ed)) + facet_wrap(income$sex) + geom_smooth(method = "lm")
coplot(earn ~ ed | race*sex, data = income, panel = panel.smooth)
plot_ly(data = income, x = ~ed, y = ~earn, type = "scatter",color = ~sex,frame = ~cut(ed,10), size = 5)
## Warning in RColorBrewer::brewer.pal(N, "Set2"): minimal value for n is 3, returning requested palette with 3 different levels
## Warning in RColorBrewer::brewer.pal(N, "Set2"): minimal value for n is 3, returning requested palette with 3 different levels
Points to be noted:
Could we see trcaes of discrination among races?
Could we see traces of discrimination between genders?
Now we will start with story No-2: